Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] add DELETED file update task #144494

Merged
merged 1 commit into from
Nov 3, 2022

Conversation

joeypoon
Copy link
Member

@joeypoon joeypoon commented Nov 3, 2022

Summary

Adds Kibana task for updating file status to DELETED when there are no file chunks.
See #143459 (comment).

Checklist

For maintainers

@joeypoon joeypoon added release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team Team:Defend Workflows “EDR Workflows” sub-team of Security Solution v8.6.0 labels Nov 3, 2022
@joeypoon joeypoon requested review from pzl and paul-tavares November 3, 2022 03:34
@joeypoon joeypoon force-pushed the feature/fleet-deleted-file-task branch from 3f1d1d1 to e1e02a0 Compare November 3, 2022 04:52
@joeypoon joeypoon marked this pull request as ready for review November 3, 2022 04:54
@joeypoon joeypoon requested review from a team as code owners November 3, 2022 04:54
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-onboarding-and-lifecycle-mgt (Team:Onboarding and Lifecycle Mgt)

@joeypoon joeypoon force-pushed the feature/fleet-deleted-file-task branch from 7fc9603 to a442cd5 Compare November 3, 2022 05:16
Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@pzl pzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

hits: {
hits: [
{
_index: ENDPOINT_FILE_METADATA_INDEX,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

silly nitpick, but since we're in fleet-land now, should we change these references to be "about" agent instead of endpoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I just don't know the index names 😅. might be bad search-fu on my end but I couldn't spot any references to other .fleet-*-files type indices. happy to update if anyone can point me in the right direction.

Copy link
Contributor

@juliaElastic juliaElastic Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have used .fleet-agent-files in my pr, probably doesn't matter too much for the unit test.

@joeypoon joeypoon force-pushed the feature/fleet-deleted-file-task branch from a442cd5 to 73845e5 Compare November 3, 2022 13:34
@joeypoon joeypoon force-pushed the feature/fleet-deleted-file-task branch 2 times, most recently from 74dbbff to a23d1de Compare November 3, 2022 16:15
Adds Kibana task for updating file status to DELETED when there are no
file chunks.
@joeypoon joeypoon force-pushed the feature/fleet-deleted-file-task branch from a23d1de to 585f4cd Compare November 3, 2022 18:22
Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response ops changes LGTM. Reviewed task type definition only

@joeypoon joeypoon enabled auto-merge (squash) November 3, 2022 18:27
@joeypoon joeypoon merged commit e257e8a into elastic:main Nov 3, 2022
@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id before after diff
osquery 1 2 +1

ESLint disabled line counts

id before after diff
enterpriseSearch 19 21 +2
fleet 58 64 +6
osquery 108 113 +5
securitySolution 440 446 +6
total +19

Total ESLint disabled count

id before after diff
enterpriseSearch 20 22 +2
fleet 66 72 +6
osquery 109 115 +6
securitySolution 517 523 +6
total +20

History

  • 💔 Build #85193 failed a23d1de9740db4455c42f0a59e6d76718172274e
  • 💔 Build #85107 failed 73845e51eb083e678eb582d44a0d297c3de83388
  • 💔 Build #84957 failed a442cd5e81d77736c5c2d618e9cd990ae1df49d0
  • 💔 Build #84955 failed 7fc9603eec7959373720f4c9430839cc02058c0e
  • 💔 Build #84949 failed 3f1d1d115ba56654bcb073ee611a88f556d8c753

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Nov 3, 2022
Copy link
Contributor

@paul-tavares paul-tavares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took me a bit of time to review. Left some comments which I don't think are critical.

@@ -194,3 +194,7 @@ on_failure:
field: error.message
value:
- 'failed in Fleet agent final_pipeline: {{ _ingest.on_failure_message }}'`;

// File storage indexes supporting endpoint Upload/download
export const FILE_STORAGE_METADATA_INDEX = '.fleet-*-files';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: name the const with the word PATTERN since its matching on wildcards.

* 2.0.
*/

export type FILE_STATUS = 'AWAITING_UPLOAD' | 'UPLOADING' | 'READY' | 'UPLOAD_ERROR' | 'DELETED';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using the FIleStatus type from the Files plugin here instead of hard-coding it:

import type { FileStatus } from '@kbn/files-plugin/common/types';

Ref:

/**
* Status of a file.
*
* AWAITING_UPLOAD - A file object has been created but does not have any contents.
* UPLOADING - File contents are being uploaded.
* READY - File contents have been uploaded and are ready for to be downloaded.
* UPLOAD_ERROR - An attempt was made to upload file contents but failed.
* DELETED - The file contents have been or are being deleted.
*/
export type FileStatus = 'AWAITING_UPLOAD' | 'UPLOADING' | 'READY' | 'UPLOAD_ERROR' | 'DELETED';

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally using this but moved away since I didn't want to add a plugin dependency just for this. If that isn't a big deal, can add it back in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need a plugin dependency for the request diagnostics feature anyway

@joeypoon joeypoon deleted the feature/fleet-deleted-file-task branch November 4, 2022 14:47
kibanamachine added a commit that referenced this pull request Nov 23, 2022
# Backport

This will backport the following commits from `main` to `8.6`:
- [Fleet Usage telemetry extension
(#145353)](#145353)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Julia
Bardi","email":"[email protected]"},"sourceCommit":{"committedDate":"2022-11-23T09:22:20Z","message":"Fleet
Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses
https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet
to the telemetry that I added for each requirement.\r\nPlease review and
let me know if any changes are needed.\r\nAlso asked a few questions
below. @jlind23 @kpollich \r\n\r\n6. is blocked by
[elasticsearch\r\nchange](elastic/elasticsearch#91701)
to give\r\nkibana_system the missing privilege to read
logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning
from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n-
[x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent
running: `agent.version` field on\r\n`.fleet-agents`
documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n
],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can
query for `.fleet-policies` where some `input` has
`type:\r\n'fleet-server'` for this, as well as use the `Fleet Server
Hosts`\r\nsettings that we define via saved objects in
Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\":
[\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n
\"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\":
\"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number
of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did
we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n
\"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those
policies\r\nCollecting this from ts logic, querying from
`.fleet-policies` index.\r\nThe alternative would be to write a painless
script (because the\r\n`outputs` are an object with dynamic keys, we
can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\":
{\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n
}\r\n```\r\n\r\nDid we mean to just collect the types here, or any other
info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin
failures\r\nWe only have the most recent checkin status and timestamp
on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last
checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin
status currently.\r\nOr do we mean to publish specific info for all
agents\r\n(`last_checkin_status`, `last_checkin` time,
`last_checkin_message`)?\r\nAre the only statuses `error` and `degraded`
that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\":
{\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6.
Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean
here elastic-agent logs only, or fleet-server logs as well\r\n(maybe
separately)?\r\n\r\nI found an alternative way to query the message
field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET
logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n
\"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\":
\"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n
\"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n
\"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n
\"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n
\"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n
}\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample
response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n
\"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n
\"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline
agents\",\r\n \"regex\":
\".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n
\"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n
\"key\": \"\"\"stderr panic close of closed channel n ngoroutine running
Stop ngithub.com/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5
\\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go
n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past
period of time\r\n\r\nI think this is almost the same as #5. The
difference would be to report\r\nnew failures happened only in the last
hour, or report all agents in\r\nfailure state. (which would be an
increasing number if the agent stays\r\nin failed state).\r\nDo we want
these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr
query, instead added a new field to report\r\nagents enrolled per policy
(top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\":
{\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n
\"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of
Elastic Agent and number of fleet server\r\n\r\nThis is already there in
the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\":
0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n
\"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n
\"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n
\"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n
\"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n
},\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<[email protected]>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Team:Fleet","v8.6.0","v8.7.0"],"number":145353,"url":"https://github.com/elastic/kibana/pull/145353","mergeCommit":{"message":"Fleet
Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses
https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet
to the telemetry that I added for each requirement.\r\nPlease review and
let me know if any changes are needed.\r\nAlso asked a few questions
below. @jlind23 @kpollich \r\n\r\n6. is blocked by
[elasticsearch\r\nchange](elastic/elasticsearch#91701)
to give\r\nkibana_system the missing privilege to read
logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning
from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n-
[x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent
running: `agent.version` field on\r\n`.fleet-agents`
documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n
],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can
query for `.fleet-policies` where some `input` has
`type:\r\n'fleet-server'` for this, as well as use the `Fleet Server
Hosts`\r\nsettings that we define via saved objects in
Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\":
[\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n
\"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\":
\"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number
of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did
we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n
\"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those
policies\r\nCollecting this from ts logic, querying from
`.fleet-policies` index.\r\nThe alternative would be to write a painless
script (because the\r\n`outputs` are an object with dynamic keys, we
can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\":
{\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n
}\r\n```\r\n\r\nDid we mean to just collect the types here, or any other
info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin
failures\r\nWe only have the most recent checkin status and timestamp
on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last
checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin
status currently.\r\nOr do we mean to publish specific info for all
agents\r\n(`last_checkin_status`, `last_checkin` time,
`last_checkin_message`)?\r\nAre the only statuses `error` and `degraded`
that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\":
{\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6.
Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean
here elastic-agent logs only, or fleet-server logs as well\r\n(maybe
separately)?\r\n\r\nI found an alternative way to query the message
field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET
logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n
\"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\":
\"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n
\"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n
\"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n
\"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n
\"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n
}\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample
response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n
\"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n
\"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline
agents\",\r\n \"regex\":
\".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n
\"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n
\"key\": \"\"\"stderr panic close of closed channel n ngoroutine running
Stop ngithub.com/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5
\\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go
n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past
period of time\r\n\r\nI think this is almost the same as #5. The
difference would be to report\r\nnew failures happened only in the last
hour, or report all agents in\r\nfailure state. (which would be an
increasing number if the agent stays\r\nin failed state).\r\nDo we want
these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr
query, instead added a new field to report\r\nagents enrolled per policy
(top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\":
{\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n
\"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of
Elastic Agent and number of fleet server\r\n\r\nThis is already there in
the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\":
0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n
\"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n
\"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n
\"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n
\"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n
},\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<[email protected]>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9"}},"sourceBranch":"main","suggestedTargetBranches":["8.6"],"targetPullRequestStates":[{"branch":"8.6","label":"v8.6.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/145353","number":145353,"mergeCommit":{"message":"Fleet
Usage telemetry extension (#145353)\n\n## Summary\r\n\r\nCloses
https://github.com/elastic/ingest-dev/issues/1261\r\n\r\nAdded a snippet
to the telemetry that I added for each requirement.\r\nPlease review and
let me know if any changes are needed.\r\nAlso asked a few questions
below. @jlind23 @kpollich \r\n\r\n6. is blocked by
[elasticsearch\r\nchange](elastic/elasticsearch#91701)
to give\r\nkibana_system the missing privilege to read
logs-elastic_agent* indices.\r\n\r\nTook inspiration for task versioning
from\r\nhttps://github.com//pull/144494/files#diff-0c7c49bf5c55c45c19e9c42d5428e99e52c3a39dd6703633f427724d36108186\r\n\r\n-
[x] 1. Elastic Agent versions\r\nVersions of all the Elastic Agent
running: `agent.version` field on\r\n`.fleet-agents`
documents\r\n\r\n```\r\n\"agent_versions\": [\r\n \"8.6.0\"\r\n
],\r\n```\r\n\r\n- [x] 2. Fleet server configuration\r\nThink we can
query for `.fleet-policies` where some `input` has
`type:\r\n'fleet-server'` for this, as well as use the `Fleet Server
Hosts`\r\nsettings that we define via saved objects in
Fleet\r\n\r\n\r\n```\r\n \"fleet_server_config\": {\r\n \"policies\":
[\r\n {\r\n \"input_config\": {\r\n \"server\": {\r\n
\"limits.max_agents\": 10000\r\n },\r\n \"server.runtime\":
\"gc_percent:20\"\r\n }\r\n }\r\n ]\r\n }\r\n```\r\n\r\n- [x] 3. Number
of policies\r\nCount of `.fleet-policies` index \r\n\r\nTo confirm, did
we mean agent policies here?\r\n\r\n```\r\n \"agent_policies\": {\r\n
\"count\": 7,\r\n```\r\n\r\n- [x] 4. Output type contained in those
policies\r\nCollecting this from ts logic, querying from
`.fleet-policies` index.\r\nThe alternative would be to write a painless
script (because the\r\n`outputs` are an object with dynamic keys, we
can't do an aggregation\r\ndirectly).\r\n\r\n```\r\n\"agent_policies\":
{\r\n \"output_types\": [\r\n \"elasticsearch\"\r\n ]\r\n
}\r\n```\r\n\r\nDid we mean to just collect the types here, or any other
info? e.g.\r\noutput urls\r\n\r\n- [x] 5. Average number of checkin
failures\r\nWe only have the most recent checkin status and timestamp
on\r\n`.fleet-agents`.\r\n\r\nDo we mean here to publish the total last
checkin failure count? E.g. 3\r\nif 3 agents are in failure checkin
status currently.\r\nOr do we mean to publish specific info for all
agents\r\n(`last_checkin_status`, `last_checkin` time,
`last_checkin_message`)?\r\nAre the only statuses `error` and `degraded`
that we want to send?\r\n\r\n```\r\n \"agent_last_checkin_status\":
{\r\n \"error\": 0,\r\n \"degraded\": 0\r\n },\r\n```\r\n\r\n- [ ] 6.
Top 3 most common errors in the Elastic Agent logs\r\n\r\nDo we mean
here elastic-agent logs only, or fleet-server logs as well\r\n(maybe
separately)?\r\n\r\nI found an alternative way to query the message
field using sampler and\r\ncategorize text aggregation:\r\n```\r\nGET
logs-elastic_agent*/_search\r\n{\r\n \"size\": 0,\r\n \"query\": {\r\n
\"bool\": {\r\n \"must\": [\r\n {\r\n \"term\": {\r\n \"log.level\":
\"error\"\r\n }\r\n },\r\n {\r\n \"range\": {\r\n \"@timestamp\": {\r\n
\"gte\": \"now-1h\"\r\n }\r\n }\r\n }\r\n ]\r\n }\r\n },\r\n
\"aggregations\": {\r\n \"message_sample\": {\r\n \"sampler\": {\r\n
\"shard_size\": 200\r\n },\r\n \"aggs\": {\r\n \"categories\": {\r\n
\"categorize_text\": {\r\n \"field\": \"message\",\r\n \"size\": 10\r\n
}\r\n }\r\n }\r\n }\r\n }\r\n}\r\n```\r\nExample
response:\r\n```\r\n\"aggregations\": {\r\n \"message_sample\": {\r\n
\"doc_count\": 112,\r\n \"categories\": {\r\n \"buckets\": [\r\n {\r\n
\"doc_count\": 73,\r\n \"key\": \"failed to unenroll offline
agents\",\r\n \"regex\":
\".*?failed.+?to.+?unenroll.+?offline.+?agents.*?\",\r\n
\"max_matching_length\": 36\r\n },\r\n {\r\n \"doc_count\": 7,\r\n
\"key\": \"\"\"stderr panic close of closed channel n ngoroutine running
Stop ngithub.com/elastic/beats/v7/libbeat/cmd/instance Beat launch.func5
\\n\\t/go/src/github.com/elastic/beats/libbeat/cmd/instance/beat.go
n\r\n```\r\n\r\n\r\n- [x] 7. Number of checkin failure over the past
period of time\r\n\r\nI think this is almost the same as #5. The
difference would be to report\r\nnew failures happened only in the last
hour, or report all agents in\r\nfailure state. (which would be an
increasing number if the agent stays\r\nin failed state).\r\nDo we want
these 2 separate telemetry fields?\r\n\r\nEDIT: removed the last1hr
query, instead added a new field to report\r\nagents enrolled per policy
(top 10). See comments below.\r\n\r\n```\r\n \"agent_checkin_status\":
{\r\n \"error\": 3,\r\n \"degraded\": 0\r\n },\r\n
\"agents_per_policy\": [2, 1000],\r\n```\r\n\r\n- [x] 8. Number of
Elastic Agent and number of fleet server\r\n\r\nThis is already there in
the existing telemetry:\r\n```\r\n \"agents\": {\r\n \"total_enrolled\":
0,\r\n \"healthy\": 0,\r\n \"unhealthy\": 0,\r\n \"offline\": 0,\r\n
\"total_all_statuses\": 1,\r\n \"updating\": 0\r\n },\r\n
\"fleet_server\": {\r\n \"total_enrolled\": 0,\r\n \"healthy\": 0,\r\n
\"unhealthy\": 0,\r\n \"offline\": 0,\r\n \"updating\": 0,\r\n
\"total_all_statuses\": 0,\r\n \"num_host_urls\": 1\r\n
},\r\n```\r\n\r\n\r\n\r\n\r\n### Checklist\r\n\r\n- [ ] [Unit or
functional\r\ntests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)\r\nwere
updated or added to match the most common
scenarios\r\n\r\nCo-authored-by: Kibana Machine
<[email protected]>","sha":"e00e26e86854bdbde7c14f88453b717505fed4d9"}}]}]
BACKPORT-->

Co-authored-by: Julia Bardi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Defend Workflows “EDR Workflows” sub-team of Security Solution Team:Fleet Team label for Observability Data Collection Fleet team v8.6.0
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

8 participants